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Abstract 


Process migration is relocating a process from one machine to another machine dur- 
ing its execution. It is a mechanism to utilize the idle computing power that is being 
wasted in a distributed system. In this thesis, distributed process subsystem, that 
provides migration facilities, is developed on SunOS. File maintenance, maintenance 
of process relationships and a load balancing policy are developed for migration. The 
memory management used for this system is from rfork model, already implemented. 
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Chapter 1 
Introduction 


Migration is a potential concept of distributed systems through which valuable com- 
putational power can be utilized. Migration of a process is the transportation of a 
running process from one machine to another. The most common objective of using 
process migration is homogeneous distribution of load on all machines in a network. 

1.1 Objective 

Our primary goal is to provide migration facilities useful for load balancing. Apart 
from this main objective, there are certain guide lines which have been followed, as 
far as possible, in developing the model. 

• Transparency: After migration, process should behave in the similar manner 
as before migration. Additionally, process should see the outside world same 
as before and outside world should see the process as before. 

• Compatibility: The approach should be compatible with Unix semantics 
and should be achieved with minimal modifications in Unix kernel. Until the 
model gets stabilized, it should support conventional Unix semantics. 

Our approach should give maximal support to processes, whose computation is 
high and interaction with outside world is low. At the same time, it should provide 
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environment to processes whose interaction with outside world is complex, as far as 
possible. 


1.2 Method followed 

We provide migration facilities by means of few system calls and migration server 
mechanism on top of SunOS kernel. Load balancing is provided by user level software 
and load daemon to calculate dynamic load on the system. 


1.3 Motivation 

Typical workstations show a low average CPU utilization and many idle intervals. 
Especially at late nights, a significant proportion of workstations on network is 
unused or under light load. These workstations have spare processing capacity, 
and can be used for users, who logged at other workstations and feel insufficient 
computing capability at their sites. By using the unused computing power at other 
sites, high system utilization can be achieved. 

1.4 Organization of the report 

The thesis has been organized as follows. In the next chapter, we present the 
related work. In chapter three, we explore the issues related to process migration. 
In chapter four, we provide the details about design. In chapter five, we present 
our implementation issues. Finally in the last chapter, we conclude the thesis and 
discuss some future work. 
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Chapter 2 
Background 


In this chapter, concepts of distributed systems and related work are presented. 

'I'wo advances in technology, namely the development of very powerful integrated 
circuits and the availability of high speed networks, led to the existence of computing 
systems composed of large number of CPUs. 

2.1 Systems with multiple CPUs 

Th(%se sy.sttuns can be characterized as follows. 

• Multiprocessor system consists of tightly coupled hardware. 

• Distributed system consists of loosely coupled hardware, but tightly coupled 
software. 

• Network system consists of loosely coupled hardware and loosely coupled 
software. 

In loosely coupled hardware, there are autonomous computers that have no 
shared memory and communicate through messages using network. These are also 
known as NORMA ( No Remote Memory Access ) systems. In tightly coupled hard- 
ware systems, many CPUs share the same physical memory through their address 
spaces. Data transfer rate is low in loosely coupled hardware. In tightly coupled 
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software system, software provides the view of a single virtual machine, even though 
hardware is loosely coupled. 

The only difference between distributed systems and network systems is in ac- 
cessibility of resources using software. In network systems, users access resources 
explicitly. Distributed systems develop illusion to the users that entire network is a 
virtual miiproctssor. In distributed system, software conceals underlying hardware. 


2.2 Definition of distributed System 

A distributed system is a coll{x:tion of independent computers that appear to the 
users of the system a.s a single computer system [Tanen95]. 

This definition implies two aspects of the system. 

• Hardware component: Autonomous computers linked by a network. 

• Software component: Distributed system software equipped at every site, 
that provides interface to other sites. This software is responsible for providing 
the look of a single computer to the user. 

The ti'rm ’’distributed system” is applied for any software system working on a 
network of comput<*rs, in which resources are not localized to any particular com- 
puter and yet treated uniformly at all computers. In other words, underlying hard- 
ware is transparent. 

2.3 Advantages of distributed systems 

There are many advantages of distributed systems over conventional centralized 
systems. 

• Resource sharing: When a facility is not available at a machine, a dis- 
tributed system facilitates the use of the same facility, if it is available at some 
other site. For example, in a load balancing distributed system, idle computing 
resources are engaged with the work. 
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• Transparency: Users need not identify the machines where printers, files 
etc. arc located. Users can access them irrespective of their position in a 
distributed system. 

• Fault tolerance: Distributed systems can withstand crashes and failures of 
system components by maintaining redundancy and recovery. 

• Parallelism: At the same time, many machines can be used for achieving 
single task and degree of parallelism can be achieved with economy. 

• Flexibility: A distributed system may undergo changes and it should be 
flexible enough for importing them according to future demands. Micro kernels 
are design'd with this motive. They provide minimal kernel support and leave 
many services to user level servers. Flexibility is of two types. 

Openness: Flexibility to replace or modify any component of system. 

Scalability: Flexibility to add new component to system. 

Here component is either software or hardware component. 


2.4 Distributed computing 

As the name implies, distributed computing is the utilization of the computing 
resources on many machines. Recently, it has become an attractive topic due to its 
capability to substitute the parallel computing. The fact is that buying computers 
of smaller size and networking them is cheaper than purchasing a massive parallel 
processor. The objectives behind distributed computing are parallelism and load 
balancing. PVM [Gei93] is example of a software interface to exploit parallelism 
across network of workstations and REM [ShojaST] is example of load balancing 
system. 

Distributed computing can be either at granularity of tasks(iater-task comput- 
ing) or at granularity of subtasks within a task(intra-task computing). Inter-task 
computation is more preferable for distributed systems due to low communication 
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and synchionizatioii overhead. Intra-task computation is more suitable for multipro- 
cessors. Developing distributed process subsystem leads to distributed computing 
of former type. 

There are two approaches to distributed computing. 

• Dser has to specify tasks or subtasks. Strictly speaking this approach is net- 
work computation. PVM [Gei93] is an example of such system. 

• System s<*lt*cts suhta.sks or different tasks that can be distributed. For example 
Ajno(*i)a (MuSo90] is an operating system where tasks are distributed by the 
OS. 

2.5 Impact of imix model on distributed OS 

In past many distributed operating systems have been developed. Sprite [DoOu89],Mach [A 
Chorus [Coul91] and Amoeba [MuSo90] are few such examples. Many operating 
systems ar(> not <-ompatahle with Unix operating system. Unix is a widely accepted 
system, hut it has a drawback that it is not a distributed one. To understand its 
effect on oilier operating systems, consider the following. 

While discussing Mach and its compatibility with Unix, Black et al [Bla95] says, 

no matter how novel its features, how elegant its design or how extensible 
its structure, it could succeed only if its Unix emulation was as good as 
or better than the native Unix on every platform on which it ran. 

So a distributed software that is compatible with Unix is useful. One of the 
reasons behind Mach’s stepping forward than others is it can emulate and provide 
environment of Unix. 

Another approach is developing distributed extensions to Unix, instead of de- 
veloping distributed operating systems. Interprocess communication facilities in a 
network, provided by BSD 4.3 [Leff89] make this approach feasible. File subsystem 
was the only subsystem that was implemented successfully in distributed manner in 
Unix OS. Other subsystems have not yet been implemented completely. Even in file 
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subsystem, complete transparency is not achieved. In this thesis also, an attempt is 
made to develop distributed layers on the top of Unix subsystems. 

2.6 Components of distributed OS 

Though one can not enumerate subsystems of a distributed system, as new compo- 
nents are always being added and it should be open and extensible, there are some 
major and minimal parts of OS that can be listed. Some major components are as 
follows. 

• Process subsystem consists of process creation, scheduling and maintenance of 
processes. 

• Interprocess communication, mutual exclusion. 

• Memory subsystem consists of memory allocation, memory protection and man- 
agement. 

• File subsystem consists of file management and file access facilities. 

• I/O subsystem, peripheral device handling. 

• User authentication, access control and network security. 

• Uniform naming scheme in network. 

These components are however not completely independent of each other. 

Some examples of implementation of distributed subsystems are as follows. 

• Amoeba’s run server model [MuSo90] 

• Andrew File system {Coul94] 

• Sun NFS [Tanen95, RFC1094] 

• Munin’s distributed shared memory [Coul94] 

• Iva's working page owner method [Coul94] 
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• Kerberos authentication system [SNS88, Stev92] 

• Domain name service [Coul94] 

• Mach's message-memory duality model [RYT87]. 


2.7 Work related to migration 

Several migration facilities for the distributed systems are developed. Petri et al. 
[PeLi95] says, 

Due to the complex nature of the subject, all those facilities have limi- 
tations that make them usable for only limited cases of applications and 
environments. 

Locus [WaPo83], Charlotte [ArFi89], V [Cher88], Sprite [DoOu89], MDX [Sch95] 
and MOSIX [Smit88] are operating systems that provide migration. Only Locus 
is the Unix compatible among these. Another credit of it is maintenance of pipe 
semantics using a network wide file system. Charlotte is first to divide migration into 
two layers. Sprite achieved high distributed functionality except for devices whidi 
are not known across machines. It has its own file system, but has a drawback 
that a migrated process depends heavily on original machine. An attempt to reduce 
memory transfer time is made in V Kernel. It implements pre-copying, technique 
similar to prepaging. MDX is complete object oriented implementation. MOSIX 
kernel contains one more interface layer, in addition to Charlotte layers. 

Condor [LiMa92] is successful among the user level implementations. It wraps 
the mainO routine and system calls in C program and maintains dummy processes. 
It uses setjmpO and longjmpO calls in Unix to get its memory image. Freed- 
man [Freed91] is another user level system. It supports only memory transfer. Model 
given in [PeLi95] is similar to Condor, but it uses debugger interface provided by 
Unix. 

Micro- kernel based migration is another attempt in research. Zayas [ZayaST] has 
done on Accent operating system, a predecessor of Mach OS [Acc86]. Mach based 
migration is another interesting work, since the OS is open and extensible. Lazy 
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copying is a technique in Mach that is helpful for migration. Actor based migration 
can be done in Chorus [Smit88]. 

Mach like o])eraling systems are more suitable to add distributed subsystems 
like migration, since it is a message based OS. In a message based OS, processes 
interact by sending messages. By redirecting messages, new components of OS can 
be added or modified. In procedure oriented OSes like Unix and Sprite, interaction 
is by system calls, subroutines executing in kernel mode. Modifications to the OS 
require changes to these subroutines. Moreover Unix has monolithic kernel and it 
is therefore difficult to incorporate these changes. 

Apart from the lack of complete Unix semantics, all the migration facilities suffer 
from one serious problem, that is lacking an efficient load distribution algorithm. 
I'his is an interesting area of research. The most cost-effective algorithm is, sending 
a random probe to a destination and taking decision [Tanen95]. 

2.8 Conclusion 

Unix is designed for centralized systems and it satisfies all the requirements in such 
environments. However, it is difficult to modify it according to the emerging de- 
mands. It is not a distributed OS. Its structure is monolithic and it is not a message 
passing OS. Still all the research revolves around unix, since it has gained popularity 
of both users and researchers and is simple and sufficient enough for most of the 
current demands. It is also important to note that no other OS has satisfied the 
needs of such vast community upto now. So, we need facilitations like migration to 
be with Unix. 
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Chapter 3 


Process Migration 


Process migration is relocating the process during its execution. It is not the migra- 
tion of code. It consists of maintenance and transfer of process state. Process state 
is composed of memory image of process, contents of registers and program counter, 
usage of resources, and communication to the outside world. For transferring a 
process from one machine to another, there are some phases. 

3-1 Phases of migration 

When a process migrates from machine Ml to machine M2, it undergoes the fol- 
lowing stages. 

a) Suspension of process on Ml. 

b) Check pointing it, t.c. collecting snapshot of process on Ml. 

c) Transfer of snapshot to M2. 

d) Initializing system data structures on M2 and reestablishing communication 
with outside world. 

e) Restarting it on M2. 

Usually, Phase b and phase c are performed in parallel to eliminate the need 
of buffering. However, in case for fault tolerance systems [Sri91], these can not be 
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done in parallel bocaxise here checkpoint file is prepared periodically even if transfer 
is not needed. 


3.2 Inherent problems with migration 

Process migration has some inherent bottlenecks. There is a time lag between 
suspension and restarting, during which the process is not active. Thus it can not 
respond to outside events and therefore demands the queuing of essential events. 

Moreover process faces some unexpected events due to its reincarnation. For 
example, when a file is successfully opened by a process, it exists throughout its 
extH'ution. Hut file cannot be seen after migration, if another process deletes it 
during its migration. 

Anotlu'r problem is it may not access resources in the same way it does on the 
original machine. This is due to subsystems, which are not distributed in nature. 
Pipes, s<‘maphores etc. are designed by keeping uniprocessor system in mind. They 
arc not suited to distributed system without major modifications. 

Another problem is fragality. All parts of the OS have to be considered while 
introducing migration facilities. Even after introduction, every subsystem has to 
(•onsi<ler its <’ffect on migration, whenever it needs modifications. 


3.3 Migration vs. remote execution 

I'here are some other approaches that can be used in place of migration. Rollback 
and shadow paging are alternatives used in fault tolerant systems. We will not be 
discussing them, as they are out of scope for this work. 

Remote execution is an alternative approach for load balancing scheme. But- 
ler [Nidi87] and REM (Shoja87] are examples systems employing remote execution. 
With experience from Butler [Nich87], a load balancing system based on remote 
execution, Nicols noted that addition of migration facility will make his system 

convenient. 
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Migration has certain advantages over remote execution. Remote execution pro- 
grams run entirely on remote machine where as migration may occur at any time 
during the <-x<>(-uti()n. Migrated programs run partially on a machine till the migra- 
tion and may even come back to the original machine after few migrations. Gen- 
<'ra!ly, programs are invoked explicitly for remote execution, whereas users remain 
unaware of prori'ss migration. Butler is however an exception to this. 

('omputation intensive jobs, i.e. jobs requiring low interaction are large in num- 
Ix'r and incur no additional overhead due to migration. 


3.4 Migration mechanism 

Migration mechanism is divided into strategy and policy. 

1) Migration Policy; It concerns design decisions like when to migrate and 
wiiere to migrate. It consists the way of selecting idle machine, time of migra- 
tion and eviction. 

2) Migration strategy: It consists of implementation of check pointing, trans- 
fer, setu}) and maintenance. 

Such a layi'ring offers certain advantages, like code modularity. For example, 
same bot tom lay<'r can be used for different policies of upper layer like load balancing 
and resource availability. 

3.5 Migration policy 

Policy consists of following decisions. 

1) Transfer policy: A decision is to be made, whether some process needs 
migration or not. It answers the question, When to migrate?. 

2) Location policy: Once the transfer policy has decided to get rid of a process, 
the location policy has to figure out Where to migrate?. It involves selecting 
a suitable host for migration based on the load and other factors. 
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3) Selection policy: More specifically, this is job selection policy. It answers 
th(* question, What to migTate?. and selects the task to migrate. 

3.5.1 Policy characteristics 

1) Localized vs globalized: Policy can consider information only at local site 
or can depend on information at all sites in processing pool. 

2) Centralized vs distributed vs hierarchical decision making: If deci- 
sions over network are taken at a single machine,then it is central approach. If 
di’cision making authority is distributed on all sites, then it is distributed ap- 
proach. One compromise is dividing the network into groups and subgroups 
and giving authority over them to one site per group or subgroup. Subgroup 
authorities subdue to group authority. This is called hierarchical decision 
making. 

3) Static vs adaptive If policy changes according to current state, then it is 
adaptive or dynamic policy, otherwise it is a static policy, e.g. A dynamic 
policy based on the system’s load will implement the following strategy. 

If system’s load is more than threshold for the recent intervals, it increments 
threshold. 

4) Receiver oriented vs sender oriented approach: These approaches differ 
in taking initiative for migration. In one method, an idle or one, that has 
plenty of resources announces its availability, and informs it has little to do 
and is ready for extra work. In another approach, a machine overloaded or 
that hasn’t resources requests another machine to take its work. 

6) Heuristic vs deterministic: Policy has some intent like load sharing to 
migrate a process. If decisions made by a policy will certainly lead to im- 
provement to the intent, then it is called deterministic policy. These kind of 
decisions are very difficult unless the future demands for resources are known 
a priori. Since future demands of processes are not known a priori, heuristics 
based approach is used. 
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3.6 Migration strategy 

Migration strategy needs to consider the following issues. 

1) Residual dependencies: Complete state is big and if migrated in totality, 
will consume large system resources. Thus only part of state is migrated which 
is absolutely necessary for running the process on other machine. These type 
of state slicing leave residual dependencies. A good migration strategy should 
re<iuce the volume of residual dependencies, as they have to be communicated 
between machines. 

2) Memory: Memory image of process can be transfered either by freezing or 
on dtunand. 

FVeezing: By transferring the entire image at once, residual dependencies 
are minimized at the cost of large overloads in initial setup. 

On demand; In this approach, image is copied from one machine to 
another, only when there is need. This method is similar to copy on write 
tm Mach or lazy swap])ing. Many times, it saves the cost of transferring un- 
used segnumts of memory, but comes with additional overhead of determining 
whether memory needs to be copied or not at runtime. 

3) Files: 'I'he migrated process needs to access the same files in the similar way 
as it was doing before migration. There are many approaches to maintain the 
state of files after migration. 

Transferring entire state to destination: In this approach, before 
restarting process, entire file information has to be received and reestablished. 
[PeLi95] followed this approach. 

Dummy (shadow) processes: Here a process is created on the original 
machine that does file operations on behalf of the migrated process. Migrated 
pro(;oss redirects its file operations to dummy one and receive results from 
there. Condor [LiMa92] followed this approach. 
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State-full file servers: For achieving higher transparency, state is main- 
tained at file server, milike SUN NFS. During migration, server will be in- 
ftjinied of th<‘ slate of file handles. Using this type of approach, file pointer 
sharing across machines can be achieved. The sprite operating system [DoOu89] 
achieved this with its own file system. 

4) Shared memory: Distributed shared memory has to be implemented, in a 
maniu'r independent of the machine. Munin’s system and Iva are two such 
sjjc< essful systems. 

5) Mutual exclusion: 'I'here are many algorithms developed for distributed 
synchruni/ation and mutual exclusion [Meak87]. e.g. Mechkawa, Rangwaala, 
central server and ring based algorithms . 

6) Communication: Message handling is complicated task, because while mi- 
gration some messages may be half way through transfer. There are three pos- 
sible a{>proaches to follow to handle messages [ChLu89] . 

Message Redirection: If process p on Ml has a channel c to communi- 
cate witii s(»me otlitT process on machine M2 before migration and it migrates 
to machine M3, then it opens channel d from machine M3 to Ml, which is 
<isso< iated with channel c between machines Ml and M2. M2 is unaware 
of p’s migration, it uses channel c as previously. Machine Ml redirects it to 
clianiH*! d for M3. 'The scheme is transparent for M2, but each communication 
passes through an extra loop. 

Message loss recovery: After migration, M2 is informed about p’s 
migration which then establishes a new channel c' between M2 and M3. All 
the messages in transit during migration are recovered from c on M2 and sent 
on c'. The channel c is then closed. 

Message loss prevention: In this approach, M2 is informed of migra- 
tion, before nngration of process p. M2 then queues messages on channel c 
till the migration is completed. After migration M2 is again informed by Ml. 
Channel cf is then established in place of c. 
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7) Avithciit icat ion: As many resources are shared among machines, authenti- 
cation is required for eacli new connection and authorization has to be verified 
for every retjuesl. Kerboros (SNS88) is an example of authentication services. 

8) Process relationships: It is desirable to have the the same parent, and 
children after migration. This can be achieved by modifying the process table 
tuitries to give a unifu’d view. 

9) Signals: There is no way of sruiding signals to processes on other machines 
<'xcept by m<*ssages. So signal redirection has to be done to processes’ current 
local ion. 

10) Naming: He.sourct's havt* to be naniod uniquely on all machines and should 
l><‘ idt'ntiiie<l a.s same resources before and after migration. 
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Chapter 4 
Design issues 


IJffure <liM ussiug tlic design factors influencing our design are given. 


4.1 Influencing factors 

4.1.1 Working environment 

W<- ijni>lan<«*<l our system usijig SUN 3 workstations in IITK computer science de- 
partment cu!mecte<i hy etliernet in a local area network. Three of them have disks 
and Olliers are diskk'ss workstations. 

Some features of our <mvironment: 

• All workstations are running autonomously. 

• 'riiey are based on Motorola 68020 32-bit microprocessors providing 32 bit 
address space. 

• 'I'hey run SunOS, which is enhanced version of 4. 2 BSD and 4. 3 BSD [LefF89] 
Unix systems, with some features from AT&T’s system V. 3 UNDC [Bach91]. 

• No source code is available for SunOS kernel, but it is configurable. The 
makefile to gtnierate vmunix is available to provide support for driver devel- 
opment, thus t he kernel can be modified. 
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• The init_sysent . c file is available, thus we can add system calls. Wide 
networking support is available. 

• 'riiere are many utilities like adb, kadb and nl, that are useful to work with 
kernel executable file. 


4.2 SunOS environment 

SunOS is an enhanced version of BSD Unix. It differs with standard Unix in some 
aspects. We describe here the Sun process address space, kernel data structures and 
other related information. 

4.2.1 SunOS process address space 

SunOS process address space is divided into logical segments called regions. Each 
region is a contiguous area of virtual address space with in the process image, seg- 
ments can be shared or protected across processes. Each process has at least three 
regions, namely text, data and stack. Text segment contains the machine instruc- 
tions that form the executable code. This segment is read only, neither grows nor 
shrinks. Data s<>gments contains the storage for programming variables, strings, 
arrays and other data. It has two parts, initialized data and uninitialized bss. Data 
segments are modifiable. Third segment is is stack segment, that grows or shrinks 
as subroutines arc tailed. 

SunOS supports shared library concept. Two processes using same library code 
can map this segments at run time using dynamic linkage. One possible address 
space layout for a SunOS process is shown in Figure 1. 

4.2.2 Implementation of processes 

Processes are active entities in SunOS. Process state is shown in Figure 2. Every 
process has a user part and a kernel part. When system calls are invoked, kernel 
part of the invoking process becomes active and gets executed. The kernel maintains 
two key data structures related to processes, process table and user structure. 
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l‘:adi SunOS pnxfss has an entry in the kernel process table that points to a 
process struct ur<‘. Process structure contains process state that must reside in the 
memory even though process is swapped out. Various fields are as follows; 

• hh'ntification ids; Various ids like pid, uid, ppid, euid, gid, pgid. 

• Stheduling parameters; Process priority, amount of cpu time consumed, nice 
value etc. 

• Mc'iuory image: Pointers to process region table entries of text, data and stack. 

• Signals: Masks sliowing signals being ignored, being caught, being blocked etc. 

• Miscellaneous: Iwmts being waited for, alarm details, pointers to rusage struc- 
ttir<*s, pointt'rs to otlu^r related processes in process table. 

User stnictur<*, also called as u-area, contains process information that is not 
needt'd when process is not physically in memory and runnable. It includes the 
following information: 

• File descriptor table: Descriptors for file related system calls. 

• System call state; Arguments and return values of current system calls. 

• Proc<‘ss control block: Hardware specific. 

• Kernel stack. 

• R<*gister context. 

• Pointers to process structure and to different vnodes. 

• Timing and statistics related information. 

lliere are some other important structures to be maintained for each process. 

• User credentials, ucred. 

• User resource usage structure, rusage. 

• Per process page tables. 

• File table information. 
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4.2.3 Process’s life line 

\'arit)us stages during lifetime of a process are as shown in Figure 3. 



Figure 3: State transition of a process during its life 

Every process otlier than process 0 is created by a call to fork system call by 
parent process. At birth time, child process inherits all state from parent. To overlay 
its state by a new state in order to perform some other task, it calls exec system 
call. At the end of its execution or in the case of abnormal conditions, it calls exit 
system call. After exit, child’s occupied resources are released and there is no more 
execution in it. But its process table slot will be kept for it, until its parent calls 
wait system call. This state is called zombie. After wait system call, child’s status 
is given to parent and its process table slot is also released. If parent dies without 
waiting for its children, children become orphans. Process 1, init, becomes the 
parent process of them and waits for them. 

4.2.4 File maintenance 

On each machine, state of all open files is maintained in a global table. Every opened 
file by a process is associated with a descriptor. System calls like create prepare 




Figure 4; Referring a file from file descriptor 

ail entry into gU)l>al file table and obtain a pointer and store it in file descriptor 
table, a part of u-ar<'a. System calls that change state of opened file uses that file 
descriptor. Reference to a file from a file descriptor is shown in Figure 4. 

File is an abstract entity in Unix. File structure consists of a generic field fdata 
and some generic operations. If the file type is vnode, then fdata is a pointer to 
vnode st ruct ure. If it is a socket, then fdata is a pointer to socket structure. 

In the vnode structure, there is again type field, that identifies devices and regular 
files. In the socket structure, again there are generic fields, protocol control block 
and protocol switch structure. These abstractions will continue up to device driver 
layers. 


4.3 Design details 

4.3.1 Naming 

A uniform naming scheme is needed for resources to be treated same on all machines. 
Name has to explain its details like home machine. Processes are to be identified 
both locally and across all systems. One immediate answer is identifying them with 
(machine , pid) simply mpid. But there are some problems with this scheme. When 
process with pid pi on machine Ml migrates to machine M2 and gets identified 
as process with pid p2, then this information need to get propagated to all other 
processes related to it. Otherwise these processes may potentially communicate 
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with a newly created process with pid pi on machine Ml assuming it to be the old 
process itself. Allocating a slot to new process and keeping it in a special state like 
zombie is a solution resulting in wasting those slots. 

The reason for this situation is the same mpid is used for more than one process 
in different time spots of system. The pid may be reallocated but time passed away 
will not recur. Hence for any unique naming scheme, time is the best choice. A 
same principle is used for vnode generation. Vnode name is combination of inode, 
generation time and machine name. Since at one time instant, only one process may 
be created at a machine. So only (machine, generation time) fields are sufficient. 
We call this as uq^pid. But a mechanism to map Unix pids from these baked pids 
is needed. One method is maintaining mapping store on home machine for each 
process. But this store has to be changed after each migration. Another method 
is maintaining a global process table, either at central server or by distributing 
the store and management and it has to be updated properly, whenever there are 
changes. > 

4.3.2 Communication paradigm 

'I'hcre arc three paradigms for designing distributed systems. 

• Client server model: Servers manage resources and clients are resource 
ust^rs. Servers wait for request by clients in passive mode. When client con- 
tacts, connection is established and communication protocol proceeds. 

• Object oriented model: Resources are treated as objects and resource usage 
is implemented as methods on object. Underlying communication is hidden in 
object implementation and user gets the feeling of accessing local object. Mach 
follows similar approach. Since each resource is treated in a similar manner 
and identified with a port, resource access is done by sending messages to its 
port. 

• Agent oriented computing: In previous models, processing i.e. compu- 
tation will be done locally on the data that came through messages from the 
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site resource is located. In this model, computation will be carried in messages 
and done at resource’s location. The programs travel in the network based on 
the iK'cd of resources. Traveling agents is one example. This is different to 
RPC {liiNe84], since in RPC we send only arguments to computation and get 
results back, but the code does not travel across network. 

We followed the first method, since it is directly available either with sockets 
or RPC. \W are selecting stream sockets for our application, because extra 
framing, reliability checking and duplicate detection etc. are not necessary. 


4.3.3 Software structure 

We have divided our software into two layers; migration policy and migration strat- 
f^gy- 

Ktrnel level vs User level: A modified kernel is more transparent than any user 
level software in accessing system resources. With modified kernel, old executables 
will work. Moreover, we can access complete state of process using proc and user 
structures. While a user level software can not access complete state of system, e.g. 
socket buffers, the major disadvantage of kernel modification is that it is not portable 
without changes. A new technique in user level software is using debugger interface 
of ptraceO. 'Fhis technique is used in [PeLi95]. But it incurs high overhead due 
to additional context switches. Source code that can be used as starting point for 
our work is available from previous M. Tech thesis [Parik92]. Part of the code that 
is useful for us is at kernel level. Due to this reason and from above advantages, 
we decided to provide migration at kernel level. We, however, implemented load 
balancing policy at user level. 

Under part of migration strategy, there are some system calls like pm.forkQ, 
pm.migrate(), pm.receive(), pm.exitQ, pm.wait(), pTn.getpid() and pm.getppidQ. 
All these system calls interact with process migration servers that are located one 
per machine. Actual implementation of these is explained in next chapter. Server 
code is responsible for accessing the resources at other site. 
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4.3.4 Load balancing 

Our policy is to balance load on all machines listed in our processing pool. There 
are many mea.sures of load. Some are given below. 

• Cpu utilization 

• Average run queue length 

• Memory available 

• Disk operations 

• System call rate 

• Context switch rate 

Average queue length is the best estimate of load. If it is high, there is high 
probability that all other parameters are high. Every measure has one threshold 
to identify the high load on machine. Methods to identify load on a machine and 
Uvseful guidelines are given in [SPT] 

4. 3.4.1 Details of policy 

Policy decisions can be catagorised as follows. 

• Transfer policy; If the current load on a machine is more than threshold 
value, then migration will be tried to other machines. 

• Location policy: Idle host will be selected, based on the same criteria, i.e. 
if load on remote host is less than threshold, then that host will be selected. 
Hosts will be tried in round robin order, in order not to swamp a host too 
much. 

• Selection policy: Jobs, which have to be migrated are one of the two types, 
namely, 

1 Local jobs which have not yet started execution. 

2 Foreign jobs earlier migrated to this machine. This will be the case when 
load becomes more than the threshold. 
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iQ-intial threshold value 
I - Migration allowed. 

11- Migration allowed and parameters 
will be changed 

III - Parameters will be changed 

Figure 5: Semantic view of policy 
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4. 3. 4. 2 Policy characteristics 

Our transfer policy is perfectly local, it only depends upon the load on current 
host. Selection policy is also local. But location policy is not local, and can not 
be. Decision making is at all sites. Each site makes decisions for the jobs, running 
there. Our policy is sender initiated one, i.e. sender requests receiver to lessen its 
load. 

Our policy is not deterministic, it only tries to lessen the load burden on some 
machines, but it is not strict balancing for all sites. Hence, load sharing is better 
suited word for our policy than load balancing. Policy is dynamic. Decision param- 
('t.ers are changing based on loads on hosts. Threshold is changing based on load at 
all sites. 



Figure 6: Loadmin vs. loadthreshold 

4.3.4.3 Policy parameters and their interrelation ship 

• Load threshold: This is the crucial parameter for decision making and set 
to a well analyzed value initially. If load at a site is less than this, decisions 
take place. It changes its value based on LoadMin. 

• LoadMin: Minimum of current loads at all sites in pool. 
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• SelfXiOad: Load on current host. 


• Slowdown factor(a): Changing load threshold hastily can lead to some prob- 
lems like oscillating it or putting too much burden on migration and eviction. 
It should reduce meandering as possible. 

• Initial threshold(7’o): It is not a variable, but a pure pre-estimated constant. 
It is used in decision making as given below. 

Load balancing decisions: 

lf{SelfLoad is less than Load threshold), no migration at all. 
lf(Selfload > LoadThreshold and LoadMin <= Loadthreshold), 

Do migration to the node corresponding to LoadMin. 
lf(SelfLoad > LoadThreshold and LoadMin > LoadThreshold), 

No migration will take place. 

lf(LoadMin <= LoadThreshold and LoadThreshold > To), 

LoadThreshold - = (LoadMin - LoadThreshold )* a 
lf(LoadMin > LoadThreshold), 

LoadThreshold + = (LoadThreshold - LoadMin)* a 

Scheunatic view of policy is shown in Figure 5. In regions 1 and II, migration is 
helpful. In regions II and III, parameters are changed based on current parameters. 

'I'he <*ffect of above decisions is shown in Figure 6 for one possible load distri- 
bution curve. 


4.3.5 Mechanics of migration 

Migration is performed by system calls with the help of migration server, migd. 
When a user program calls migrate system call invocation routine, C library puts 
syscall number in stack and issues a trap to operating system. The system call 
handler does usual validation, checking of arguments, establishes a connection to 
migration server that is already waiting in passive mode and then sends request 
packets to the migration server. 
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In response to these requests, server forks a child and sets up environment to do 
migration, if necessary and then calls a system call, that take the process image from 
renK)te nia<hin<'. I his cod<^ first performs the authentication and informs migration 
acceptance to the remote machine. 

1 hen entire execution state is then transfered from client to server. Re estab- 
lishment of its execution state is done there and starts execution there. Server migd 
forks and calls the pm.rcctivt on receiving pm.request. 

Migrat<> system call is used for migrating a task to other machine. System call 
pmret' is used he the migd server to respond to the migrate system call. Getpmpage 
and putpmpagr have similar association. Pmopen and pmcreate are used to open a 
file instead of old calls open and creat. Close and dup are modified to get the same 
c'ffect of dui> and close after migration. Pmexit and pmwait are here for maintaining 
old semantics of wait and exit for migrated processes. Pmpage transfer routines are 
used for maintaining global process table. 

4.3.6 File maintenance 

Dummy proc<'ss a})proach is most costly, as it involves more communication between 
machines. File servers available in our lab use SUN NFS and do not maintain state 
of open files. In our approach, thc^refore, we use the following methodology. At 
the migration times file states are transferred to the peer. An NFS file is identified 
with same name on all machines. Hence, we need a mapping from file descriptor to 
file name. Each file descriptor maintains an offset into a file. After each read and 
write, file offset is changed accordingly. File sharing by two descriptors is done by 
maintaining a common offset for them. This is maintained in global file table. File 
sharing between processes is called inter process file sharing. This is possible by 
fork system call. Inter process file sharing is not manageable without maintaining 
state in the global file tables. Stateful file server is a solution for this problem. Intra 
process sharing is between descriptors of same process. This is possible by dup 
system call. Intra process sharing can be done by maintaining auxiliary information 
for each process. 
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4.3.7 Moinory transfer 

i liis is not d<'v<‘lop<‘<l in this thesis, but has been taken frona Parikh’s implementation 
[Parik?)‘2], since it is already implemented and meets our purpose. 

1 his wa.s originally developred for rfork model [Parik92] to execute the programs 
in parallel using a network. One implication from this model is checkpointing the 
addre.ss space of the same process and hence migration has to be initiated by the 
process itst'lf. 

4.3.8 Process table maintenance 

Pr<jc<‘ss tahU's are nt'cded at all sites. Central server method has critical failure 
problem. Distributed shared memory method is selected for maintaining these ta- 
bles. 

Memory occupied by process tables are divided into logical pages. 

A i>ag<' is owned by a single host and physically exists in its memory. All other 
hosts know the ownership relation for all pages. Thus an explicit synchronization 
mechanism is not needed, since a page ownership can serve as a token to access a 
page. 

Initial page allocation: 

• Method 1: Kacli machine will be given ownership for the pages for which 

p %N - rn, where p is the page no, N is the total number of machines in the 
system and m is the machine address between 0 and N — 

• Method 2: A single machine will be given ownership of all pages. 

Owner links: A machine may not have ownership for some pages, but it re- 
members the owner of every page. But this may not be correct and is only a hint 
to find exact owner and so this field is called probable owner(probowner). This hint 
field is updated in following cases. 

• When Ml transfers page p to M2, Ml updates as probowner(p) = M2. 
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• WIkmi M2 requests Ml for page p and Ml has no ownership on page, Ml in- 
forms its prol)owner{;)) to M2, and also passes the request to its own probowner(p). 


• When Ml ie<juests M2 for page p and M2 informs Ml that its owmer is M3, 
then Ml updates prol)owner{p) = M3. 

e.g., initial links and links after A gets page are shown in Figure 7. 

An Optimization: Initially no kernel memory will be allocated for pages. Owners 
allo< at<' ineinorv only after request, but they maintain a flag for them. 
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Chapter 5 


Implementation 

5.1 Data structure for process migration 

The kernel needs to maintain information about migrated process to behave like 
non migrated process after migration. This information need to be made available 
either in Proc structure or in U-area. Since SUN OS source is not available , extra 
fields can not be appended. However the U-area has a field u. u-xxx[2], that is not 
used for any purpose. Hence, state required for migration can be kept using this 
field. A structure named pm_struct_per_process is maintained for each process and 
pointer to this structure is stored in U-area fields. 


struct pm_struct_per_process{ 

int pmJiome; /* inet address of the home machine */ 

u_long pm_source; /* inet address of the source machine */ 
uJLong pmJhost; /* inet address of the host machine */ 
uJLong pm_uqpid; /* unique process_id ♦/ 
char *pm-f map [NOFILE] ; /* descriptors to name map */ 

short pm_dmap [NOFILE] ; /♦ file table entry for process *j 

int pm_stat; /* status of process */ 

} 
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Figure 8: Dup chains 


5.2 File handling 

In our approach, we store the name of the file after it is opened in Pm_fmap and 
perform intra-process sharing of files by remembering extra links. 

After each dup system call, Pm.dmap chain is modified. 

For example, when dup(l, 3), dup(3, 4) are executed, the dup chains get modified 
which are shown in Figure 8. 

Approach adopted for establishing state of files after migration is given below. 
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Client’s role: Client sends some file related information, while it is passing pm- 
struct to the server as given in the structure. 

In addition, for each opened file it also sends, 

• file descriptor 

• file offset and type from global file table. 

The server recreates file using the received information as follows. 

• For each file descriptor received, receive all file related information(file offset 
and type). 

• For each file per dup chain, open file in required mode, if type is a regular file 

• Seek to required offset 

• Dup it to fd, where it was before migration and close current fd. 

• Follow the dup chain and dup to all file descriptors in the chain. 

5.3 Process table maintenance 

Processes that are not using migration facilities will continue to work and their 
maintenance is through local process table as earlier. But processes that do migrate 
are maintained by using another global process table, in addition to the table at 
each site. The global table is maintained by shared memory. This table occupies 
one or more pages. A process accesses them by using get_pm_page and put_pm_page 
routines. Circular waiting of machines for these pages is avoided by accessing them 
in order. Get_pm_page and put_pm_page routines handle page ownerships and trans- 
fers. 

5.3.1 Control abstraction of get-pm.page 

input : pmpage /* page no to be accessed */ 
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our machine is not owner for pmpage) { 

^^pon a socket in tcp/ip mode. 

C<onnect to current prob owner site. 

Send the page no. 

reply is positive, get the page and break. 

Klsc update prob owner from reply. 

} 

2 If 

‘I our machine is the owner and memory was not allocated, allocate memory 
fo it, 

Keturn the page. 


"*•2 Control abstraction of puLpm-page 

Head the requested page no on connected socket. 

If owner, send the page. 

Else send the prob owner name. 

Change probable owner to the client’s machine. 

^'^•ch migrated process corresponds to a slot in global process table. 

structure called as pm_proc is similar to proc and contains fields pointed to 
structures of related processes. In Unix, all parent processes are also of same 
but here parent process is either a migrated one or conventional Unix process, 
^^nce, parent pointer is either into pm_procs or to Unix_parent structures that are 
®‘^®o maintained in shared pages across machines. 

Suppose a parent dies without waiting for a child, then child becomes orphan 
it is attached to the process 1, i.e., init process. But a separate approach is 
followed here in handling orphans, because of difficulty in injecting code into init. All 
^^Utines that insert, search or modify routines are similar to corresponding routines 
proc structure maintenance. But delete routine is different from corresponding 
This is needed as part of wait system call. 
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5.3.3 Control abstraction of deleting slot 

input: slot p; 

• 1. Put all of it’s children’s state to orphan. 

• 2. For all slots corresponding to zombie children, call this routine recursively. 

• 3. Modify the related links in other slots. 

• 4. Return the exit status. 

5.3.4 Control abstraction of exit routine 

input: slot status; 

• 1. If the process in that slot is orphan, delete the slot. 

• 2. Otherwise, put it in zombie state. 

In order to purge the slots of orphan, exit handler will call delete routine, if that 
process is already an orphan. 

5.4 Load balancing 

Rstatd, a daemon is running on top of RPC one per machine. It provides all statistics 
related information of kernel. Whenever a decision has to be made, RPC client 
handles have to be created and contacted with rstatds at all machines given in a 
pool file that contains list of all machines allowed for migrating. On the information 
collected from all of these machines, algorithm that implements policy is applied as 
explained in previous chapter. 

5.5 Migration procedure 

Client calls migrate system call and server communicates with it using pmrcv system 
call. 
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Control abstraction of migrate: 


1. Open a tcp/ip socket to server. 

2. send request and user credentials. Prove authentication. If reply is not positive 
return error. 

3. Send the kernel data structures proc, u, ucred, rusage and pmstruct. 

4. Send all segments of memory image. 

5. Send state of all open files. 

6. Modify the fields in process table accordingly. 

7. return. 

Control abstraction of pmrev 

1. read request and user credentials, check authentication. 

2. Read the kernel data structures proc, u, ucred, rusage and pmstruct. 

3. Receive all segments of memory image. 

4. Exceve with text, data and stack. 

5. Attach other segments. 

6. Read Other file related information and reestablish the state. 

7. return. 

5.6 Authentication 

An initial authentication is done to check whether a user is allowed to do migration. 
Another hack for not to be cheated by any user is creating a dummy file with 
restricted permissions and making that user as owner of it and asking client to do 
operations allowed to him. 
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5.7 Overview of implementation 


Client program communicates with server for either on migration protocol or on 
page transfer protocol. Loadd collects information from rstatd running on other 
machines and selects a host for migration. This is conveyed to migration server 
migd. Communication paths are shown in Figure 9. In this figure, channel 3 repre- 
sents the signal flow, channel 5 represents RPC messages while all other represent 
communication through sockets. 
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1 Server and client communicate using migrateQ and pmrcvQ 

2 Server and client communicate for page transfer 

3 Mgid informs client to migrate 

4 Idle host request and reply 

5 Load statistics request and reply 

Figure 9; Communication between various components 
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Chapter 6 
Conclusion 


6.1 Work done 

In this thesis, the design and implementation of a migration strategy and migration 
policy has been done. In this strategy, the facility to migrate opened files and pro- 
cess relationships are provided. The migration strategy ensures that opened files and 
process relationships are preserved after migration. File maintenance is provided in 
state forwarding approach. Process maintenance is provided by maintaining dis- 
tributed shared memory. 

These are implemented with few system calls and a migration server. 

The objective of migration policy has been to distribute the load on all systems. 
The policy is implemented using load daemon. The necessary information for load 
daemon is provided by migration server. 

This system is compatible with Unix OS. All computation oriented jobs can be 
migrated. But coihplex process, for example, those which use sockets etc won’t run 
in the same fashion after migration. 

6.2 Extensions 

• Instead of using NFS, stateful file server approach can be used for complete 
transparency. 
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• Maintenance of Unix abstractions like signals, pipes, sockets and semaphores 
has to be added to the system. 

• Memory maintenance is from rfork model which imposes migration on a pro- 
cess to be done by the same process. 

• An alternate approach can be used to do migration by another process. 
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